A Framework for Multilingual Searching and Meta-information Extraction
نویسندگان
چکیده
Due in large part to the popularity and global nature of the Web, multi-lingual issues in computers is finally beginning to attract serious attention, from users and developers alike. At the Software Labs in NTT, we are involved in a project that confronts multi-lingual issues in a big way. Namely, we are building software designed to self-configure a global distributed search infrastructure. This paper describes how we have architected our system in order to handle multi-lingual issues ranging from character encoding translation to language detection to term weighting. It can serve both as a case study in the design of multi-lingual distributed systems, and as a general tutorial on multi-lingual issues. It also presents some surprising conclusions, such as the fact that the popular Unicode international character encoding is not the best one to use in our environment.
منابع مشابه
A Web Smart Space Framework for Information Mining: A base for Intelligent Search Engines
A web smart space is an intelligent environment which has additional capability of searching the information smartly and efficiently. New advancements like dynamic web contents generation has increased the size of web repositories. Among so many modern software analysis requirements, one is to search information from the given repository. But useful information extraction is a troublesome hitch...
متن کاملTerminology Retrieval: Towards a Synergy between Thesaurus and Free Text Searching
Multilingual Information Retrieval usually forces a choice between free text indexing or indexing by means of multilingual thesaurus. However, since they share the same objectives, synergy between both approaches is possible. This paper shows a retrieval framework that make use of terminological information in free-text indexing. The Automatic Terminology Extraction task, which is used for thes...
متن کاملATLAS – The Multilingual Language Processing
This paper presents the ATLAS platform – multilingual language processing framework integrating the common set of linguistic tools for a group of European languages (less-resourced: Bulgarian, Croatian, Greek, Polish and Romanian together with English and German as reference languages). State-of-the-art NLP functionality offered by the platform allows for multilingual annotation of texts on low...
متن کاملModern Multilingual and Cross-lingual Information Access Technologies
In this chapter, we describe the state of the art cross-lingual and multilingual strategies and their related areas. In particular, we show a WWW-based information system called MIETTA, which allows uniform and multilingual access to heterogeneous data sources in the tourism domain. The design of the search engine is based on a new cross-lingual framework. The framework integrates a cross-lingu...
متن کاملWeb Information Mining Framework using XML Based Knowledge Representation Engine
Information or knowledge representation is one of the principal elements of artificial intelligence based applications. Conventionally, predicate logic is used in various languages to represent the required knowledge. The recent fashion in knowledge representation languages is to use XML as the low-level syntax. XML is the standard representation of multi-facet data. This generic representation...
متن کامل